bayroot: A Bayesian phylogenetic approach to dating HIV reservoir sequences

Roux-Cil Ferreira and Art Poon

Department of Pathology and Laboratory Medicine, Western University, Canada

The latent HIV reservoir

  • HIV provirus in the latent reservoir may come from an early or late stage of infection.
    • Adaptation to host-specific immune response, drug resistance.
  • Is long-term persistence of the reservoir due to clonal expansion or ongoing replication?
    • Replication in anatomic compartments with low drug penetrance would replenish a “young” reservoir.

Dating HIV integration events

  • Replication-competent provirus from the latent reservoir can be sequenced from viral outgrowth experiments.
  • One approach is to build a phylogeny and map outgrowth sequences to dated plasma HIV RNA.
  • A tree is an incomplete sample of lineages, uncertainty in topology and location of root.
Image source: Abrahams et al. (2019) The replication-competent HIV-1 latent reservoir is primarily established near the time of therapy initiation. Science Trans Med 11(513).

Root-to-tip regression

  • Under a strict molecular clock, the expected number of mutations increases linearly with time.
  • A root-to-tip (RTT) method regresses divergence from the root (\(y\)-axis) on sample collection dates (\(x\)-axis).
  • Fit regression to HIV RNA only, use model to predict integration dates.
  • May be more robust to uncertainty in tree.
Image source: Buonagurio DA, et al.. Evolution of human influenza A viruses over 50 years: rapid, uniform rate of change in NS gene. Science. 1986 May 23;232(4753):980-2.

Problems with RTT dating

  • Ignores variation in number of mutations — there is one, and only one, estimated date.
  • Unclear how to deal with divergent outgrowth sequences.
  • Proviral sequences can be mapped to the future (right, red zone).
  • Assumes location of root and clock rate are known without error.

A simple Bayesian model

  • The RTT regression model has three parameters:

    1. The date of the origin (\(x\)-intercept);
    2. The molecular clock (rate/slope, \(\mu\));
    3. The location of the root in the tree (determines divergence, \(y\))
  • We will sample these parameters from the posterior distribution.

    • Continue to assume the unrooted tree is known without error (constrains positioning of root).

Poisson likelihood

  • A strict clock assumes that mutations accumulate at a constant rate \(\mu\) over time, so variation is equal to the mean (Poisson).
    • Let \(Y_i\) be the number of mutations in the \(i^{\textrm{th}}\) sequence, given location of root.
    • Let \(\Delta t_i\) be the time elapsed between the \(i^{\textrm{th}}\) sample and the root.
  • The log-likelihood for RNA set \(\{Y_i, \Delta t_i\}\) is:

\[\log L(Y_i, \Delta t_i) = \sum_i Y_i\log(\mu \Delta t_i) - \mu \Delta t_i - \log \Gamma(Y_i+1)\]

Metropolis-Hastings sampling

Parameter Prior Proposal
Root Uniform \(\textrm{Unif}(-\delta, +\delta)\), reflection on tips and random choice of branches at splits.
Clock rate Lognormal \(\textrm{Unif}(-\delta, +\delta)\) proposal reflecting on zero.
Origin date Uniform Truncated normal proposal with mean 0 and variance \(\sigma\).

Sample run

Step 180 Step 19980

Sampling integration dates

  • Using Bayes’ rule, the probability of integration time \(t_i\) for outgrowth sequence with divergence \(Y_i\) is: \[P(t_i|Y_i) = \frac{P(Y_i|t_i) P(t_i)}{P(Y_i)}\]

  • We assume \(P(t_i) = \frac{1}{T-t_0}\), where \(t_0\) is the origin date and \(T\) is the maximum integration date, and letting \(s=t-t_0\):

\[ P(Y_i) = \frac{\int_0^{T-t_0} (\mu s)^{Y_i} \exp(-\mu s) \mathrm{d}s}{(T-t_0)\Gamma(Y_i+1)} = \frac{\gamma(Y_i+1, \mu(T-t_0))}{\mu(T-t_0)\Gamma(Y_i+1)} \]

where \(\gamma(s, x)\) is the lower incomplete gamma function.

Sampling integration dates (2)

  • Letting \(\Lambda=\mu(T-t_0)\), the probability of \(t_i\) given \(y_i\) mutations simplifies to: \[ P(t_i | Y_i) = \frac{\mu \Lambda^{y_i}\exp(-\Lambda)}{\gamma(Y_i+1, \Lambda)} \]

  • Since we can’t solve for the inverse CDF, we use a simple rejection method to sample integration times.

    • \(t_0\), \(y_i\) and \(\mu\) are sampled from the posterior distribution.

Simulation model

  • We used an exact stochastic simulation module in R (twt) to simulate cell population dynamics (forward time).
  • Discrete events with exponential waiting times determine population dynamics over time.

Simulating population dynamics

  • Branching events require a “source” cell to induce a “target” cell to change state.
    • Transmission of virus from an infected cell to a susceptible cell.
    • Virus replication completely blocked by ART initiation at time \(t=10\).
  • Transition events occur when a cell switches from one state (compartment) to another.
    • e.g., reactivation of a latently-infected cell.

Simulating trees

  • From the population size trajectories, twt samples trees in reverse time for given sampling times and cell types.
    • ▲ = branching event (infection from active cell, or clonal expansion of latent cell)
    • ○ = state transition (from active to latent or vice versa)
  • Retaining these tree annotations lets us collapse branches associated with latently-infected cells.

Simulating evolution

  • Ran trees from twt through INDELible
    • TN93 nucleotide substitution model with 40% As.
    • Seeded with an HIV-1 subtype C sequence at the root (AY772699)
  • Reconstructed trees from simulated alignments with FastTree2 (double precision build).

Comparing bayroot to root-to-tip regression

When clock signal is strong, advantage of bayroot is largely due to prior information about ART initiation.

Integration date estimates for one replicate simulation. Paired Wilcoxon sign-rank test, \(P=3.55\times 10^{-4}\), \(n=50\).

… and with greater uncertainty

Integration date estimates for one replicate simulation. Paired Wilcoxon sign-rank test, \(P=3.82\times 10^{-7}\), \(n=50\).

Results with a narrow prior on root date; with a less informative prior, bayroot results become similar to RTT.

Application to real data

ZM1044M, Zambia-Emory HIV Research Project

Brooks et al., 2020, PLOS Pathog 16(6): e1008378.

Further work

  • A simple and efficient means of incorporating prior knowledge and managing uncertainty in dating the HIV reservoir.
  • Characterize sensitivity of bayroot to varying model parameters, e.g., transition of active infected T-cell to resting state.
  • R package available from https://github.com/PoonLab/bayroot
This study was supported by funding from CIHR and NIH (REACH AI164565-01). RC was supported by an Ontario Genomics-CANSSI Fellowship in Genome Data Science.